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ABSTRACT 

Wc present a catalogue of about 6 million unresolved photometric detections in the 
Sloan Digital Sky Survey Seventh Data Release classifying them into stars, galaxies 
and quasars. We use a machine learning classifier trained on a subset of spectro- 
scopically confirmed objects from 14th to 22nd magnitude in the SDSS i-band. Our 
catalogue consists of 2,430,625 quasars, 3,544,036 stars and 63,586 unresolved galax- 
ies from 14th to 24th magnitude in the SDSS i-band. Our algorithm recovers 99.96% 
of spectroscopically confirmed quasars and 99.51% of stars to i ^21.3 in the colour 
window that we study. The level of contamination due to data artefacts for objects 
beyond i = 21.3 is highly uncertain and all mention of completeness and contamina- 
tion in the paper are valid only for objects brighter than this magnitude. However, 
a comparison of the predicted number of quasars with the theoretical number counts 
shows reasonable agreement. 
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1 INTRODUCTION 

There has been a surge in the number of large astronomi- 
cal surveys trying to map the deep sky in terms of its con- 
stituents, their number density and evolut ion since the early 
epoch s. The Sloan Digital Sky Survey (|SDSS. York et all 
l200d ) is one such survey, covering about a quarter of the 
sky and providing photometry in five optical bands for ~ 
357 mil lion objects in its seventh a nd final SDSS-II data 
release (|DR7. Abazaiian et al.|[2009l '). As less than one per- 
cent of these have been spectroscopically observed as a part 
of the survey, the exact nature of an overwhelming number 
of objects in the survey remains unconfirmed. This situation 
will also prevail with future large surveys, where the gap be- 
tween imaging and spectroscopy is only expected to widen. 
It is, therefore, necessary to develop techniques to identify 
objects reliably on the basis of their photometric data, us- 
ing the colours of spectroscopically identified objects as a 
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guide. In this paper, we will describe the use of a machine 
learning classifier to classify unresolved objects from DR7 
into three categories (quasars, stars and galaxies), using a 
sample of spectroscopically confirmed objects for training 
the classifier. 

The SPSS images objects in five bands with its imaging 
camera iGunn et al.lll998l . I2006D . allowing one to view each 
object in four independent colours. The col ours can be used 
to identify spectral classes of the objects. IStoughton et al.l 
(;20C^) describe this in detail using SDSS early data release 
of spectroscopically confirmed objects. The same method 
is used for the SDSS spectr oscopic quasar c andid ate se- 
lection pipeline described by [Richards et al.l (|2002l V This 
candidate selection procedure, along with foUowup spec- 
troscopy, has been used to produce a catalogue of over 
100,000 spectroscopically confirmed quasars for SDSS DR7, 
making it the largest available spectroscopic quasar cat- 
alogue. SDSS has u ndergone several improvements in its 
photometric quality (|Adelman-McCarthv et all |2008| ) and 
other researchers have also made use of colour as a se- 
lection tool for preparing catalogues of different objects 
of intere st. These include photome tric identification of 
galaxi es JOvaizu et al.ll200S), qu asars (|Richards et al. | |2004 
l2009al ). stars ( Covey et all 20071 ') and photometric redshift 
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l|Niemack et all l2009l : [Richards et"ai] l2009bl ) with rehable 
accuracies. 

The colours of quasars and stars are suHiciently distinct 
for them to be reasonably well separated in a two-colour di- 
agram, say U-B against B-V, but there is significant over- 
lap between the two classes and a more reliable separation 
requires a larger number of colours. Fig. [T] shows a typi- 
cal colour distribution of spectroscopically confirmed {u-g 
against g-r) unresolved objects in the SDSS. It is apparent 
that the relative number of quasars, stars and galaxies vary 
widely over different regions of the colour plane. This is a 
reflection of the fact that objects with intrinsically difl^erent 
spectra occupy different regions of the multi-colour space. 
However, there is substantial overlap of different types. The 
black box encloses the region of u-g and g-r colour space con- 
taining the highest density of the SDSS confirmed quasars. 
The region has almost equal number of low redshift quasars 
{z ^ 2.3) and stars with a marginal population of late type 
stars and unresolved galaxies. About 84 per cent of SDSS 
quasars are within this colour window which also has ~ 46 
per cent of all spectroscopic unresolved detections in the 
SDSS archive. The lower panel of Fig. [T] compare the red- 
shift of all known quasars (blue) with the redshift of quasars 
that fall within the black box (black) to show that this re- 
gion has a completeness in redshift upto ~ 2.6. 

A recent study by [Richards et al] (|2009bt ) used colours 
obtained by combining optical data from the SDSS with 
mid-infrared data to develop an eight dimensional classi- 
fication algorithm, based on Bayesian kernel density esti- 
mation, for the robust selection of quasar candidates. This 
method leads to a twenty fold increase in the surface den- 
sity of quasar candidates, with estimated 97 per cent com- 
pleteness and 10 per cent contamination, over spectroscopi- 
cally confirmed quasars in the SDSS. The photometric cat- 
alogue (Richards-f 2009 catalogue) thus produced has iden- 
tified over a million likely quasars to an order of magnitude 
deeper than the SDSS spectroscopic limit. 

We describe a different approach, which uses only SDSS 
colours and machine learning techniques (see Section [S]) to 
provide a large photometric catalogue of unresolved objects, 
with assigned probability to each object for being a quasar, a 
star or an unresolved galaxy. This allows candidates of a par- 
ticular kind with high level of completeness and low contam- 
ination to be chosen from the catalogue using an appropriate 
probability cutoff. We produce a catalogue of objects extend- 
ing to fainter magnitudes up to SDSS photometric detection 
limits. For this, we first use a subset of the SDSS spectro- 
scopically confirmed objects covering the region shown as 
black box in upper panel of Fig.[T]to train our machine learn- 
ing algorithm (here after called the classifier) and then use 
the trained classifier to test the prediction accuracy on all 
objects in the same region that have their identity confirmed 
by the SDSS spectroscopy. We observe that at brighter mag- 
nitudes, where the SDSS spectra is available, both stars and 
quasars can be photometrically separated with a contamina- 
tion of less than 1 per cent. We also extend our analysis to 
other surveys where spectroscopic classification is available. 
Finally we make a prediction on possible class of objects at 
fainter magnitudes where no spectroscopic confirmation is 
available. The contamination in those faint magnitudes are 
uncertain and it will be an acid test for the validity of the 
proposed method when the newer surveys going to fainter 
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Figure 1. Different spectral classes of spectroscopically con- 
firmed unresolved objects in u-g — g-r colour-colour space is 
shown in the upper panel. The black box shows the 2D projec- 
tion of the region used in the present study. Although it might 
appear to be dominated by low redshift quasars (blue), this re- 
gion also has a large number of main sequence stars (green). The 
SDSS spectroscopic target selection algorithm had efficiently fil- 
tered out many stars and that is why they appear in relatively 
fewer number in the plot. In addition to these, the region is occu- 
pied by high redshift quasars (red) , a few late type stars (orange) 
and galaxies (pink). About 84 per cent of all known SDSS and 
2dF quasars (blue) are within this colour window (black) covering 
the full redshift range upto z ~2.6 as shown in the lower panel. 



magnitudes will confirm the actual class of those objects. 
We also briefly discuss how machine learning tools can be 
used to identify outliers and errors in the data. 

The organization of the paper is as follows: Section [2] 
describes briefly the dataset and the colour selection crite- 
ria. The classifler is described in Section (3] The construction 
of training and test data, test results and their analysis are 
presented in Section|4l The catalogue, its format, the results 
of cross-matching, quasar number density and completeness 
are described and presented in Section [5] Finally, we sum- 
marise results in Section [S] 
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2 THE DATA - SDSS DR7 



3 CLASSIFIER 



The DR 7 catalogue contains spect ra of 930,000 galaxies, 
120,0 00 quasars and 460,000 stars (|DR7. Abazaiian et all 
I2OO9I ). The SDSS has five filters, namely u, g, r, i and 
z that give spectral coverage from - 3900 A to 9100 A 
l|Fukugita et al.lll996t ) with an exposure time of 54.1 sec- 
onds per band. Images are taken with a larg e mosaic CCD 
camera in drift scan mode (jCunn et al ] |l998l ). We used the 
'SpecPhoto' view derived from 'SpecPhotoAU' table of SDSS 
Catalogue Archive Server (CAqj) for the training part of 
this study. 'SpecPhoto' consists of the SDSS spectroscopi- 
cally confirmed detections with clean spectra. The spectral 
classification in this table is labelled 'SpecClass' and is given 
numerical labels ranging from 1 to 6 to represent the differ- 
ent spectroscopic types. 

The objects used in this study are unresolved point 
sources which have SDSS i-band point spread function (psf) 
magnitude ranging from 14 to 24 and which occupy the 
colour window region defined by the colour cuts in Table [1] 
This region has 106,466 unresolved spectroscopically classi- 
fied objects that we used for training and testing our classi- 
fier. This population contains similar numbers of stars and 
quasars with a small number of faint unresolved galaxies 
scattered over the region. The data for the photometric 
catalogue that we describe later also was taken from the 
same region. This gave us three groups of data. The first 
two groups, namely, training and testing data have spectro- 
scopic confirmation of their identity. The training data are 
used to adjust the parameters of our classifier during the 
training process and the quality of training achieved is as- 
sessed using test data. In the testing round, the classifier 
predicts the identity of the object based on what it learned 
from the training data. Since the spectroscopic identities for 
test data is available, this allows us to determine the com- 
pleteness and contamination in the predicted classes. All 
unresolved objects from the region that had spectroscopic 
confirmation were included in the test data, while a smaller 
subset of about 10 per cent of it were used as the train- 
ing data. The third dataset in the group, referred to as the 
prediction data was the larger dataset that included all unre- 
solved point sources in the region irrespective of whether or 
not they had a spectroscopic confirmation. The predictions 
made on these data are compared with 29 publicly available 
catalogues to determine the accuracy of our predictions at 
brighter magnitudes. We describe the details of the training 
and testing procedure in section |4] All magnitudes used in 
this paper are iiber-calibra ted psf magnitudes described by 
IPadmanabhan et al.l (|2008h . The 'iibercalibration' improves 
the photometric fidelity of SDSS data that represent the 
most robust photometric measurements and was first intro- 
duced with DR6 data. SDSS rep orts the photometric mea- 
surements in asinh magnitudes (|Lupton et al.l Il999l ) . The 
magnitudes used have all been corrected for galactic extinc- 
tion. 



We used a Difference Boosting Neural Network (DBNN) 
(|Philip fc Joseph Il2000^ classifier which is a Bayesian super- 
vised learning algorithm. The DBNN has bee n used in the 
past for successful star-galaxy classificatio n (iPhilip et al, 
2002 ) J galaxy morpholog y classification jOdewahn et al 



2OO2I : iGoderva et al Il2004f) and quasar candidate identifica- 
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tion (|Sinha et al.ll2007h problems. Bayes theorem allows one 
to compute the probability for an event to occur based on 
some prior belief and a likelihood for the event to be related 
to an observation. The prior belief is the domain knowledge 
about the event and usually is the most difficult quantity 
to estimate correctly. The estimation of the likelihood also 
can be difficult when there is conditional dependence be- 
tween the observations. For example, saying that the colour 
of an object is red alone does not allow one to say what 
the object is. Some additional information related by the 
logical AND operation is to be associated with the colour 
to make the communication meaningful. We refer to this as 
conditional dependence. In such situations, the likelihood 
has to be computed in consideration of all the associated 
conditions, thereby making Bayesian estimation computa- 
tion intensive. 

The DBNN suggests that binning can be used to as- 
certain conditional independence on the observations. This 
might appear to be an unrealistic constraint; however, it is 
not. Classification inherently demands uniqueness in obser- 
vations. If we bin the observed feature sufficiently narrowly 
(like a histogram with small bin sizes), each bin would cap- 
ture the likelihood for the feature to occupy a certain bin 
location. If we also make separate bins for each class, those 
will represent the likelihood for the feature to be in a given 
bin for each class. Now, the value of a feature will always 
impose some constraint on the possible values the other fea- 
tures can have. The method works if there are sufficiently 
large number of events (counts) as compared to the number 
of bins to give a faithful estimate of the likelihood. For this 
study, our training sample has about 14,356 objects and we 
use 61 bins for each feature. This is not a critical number 
and the results would not have changed if we had used a 
somewhat different number. The procedure is to start with 
small values and gradually increase the bin size until the dif- 
ferent classes and their diversities are adequately captured 
by the learning algorithm. Since there are no quantitative 
measures for the adequate capture of diversities, it is of- 
ten computed, by trial and error, as the best bin size that 
maximises the prediction accuracy by the classifier. A great 
advantage of the binning scheme is that conditional inde- 
pendence allows the posterior Bayesian probability to be 
computed as the product of the individual probabilities and 
thus significantly simplifies the computational overheads. In 
addition, the binning scheme allows the classifier to have 
some of the advantages of non-parametric classifiers while 
retaining some of the advantages of a parametric classifier 
that the effective number of features does not grow with the 
size of the dataset. 

A second issue that often affects the Bayesian compu- 
tation is the uncertainty of the prior. The binning scheme 
has further complicated the estimation of the prior since 
we now need to know the prior for the likelihood for each 
bin of the input feature to compute the posterior. DBNN 
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resolves this issue by computing the prior from the data. 
In the Bayes formula, the prior appears as a multiplicative 
term. The DBNN initially assumes a flat prior for all the 
bins. In the training phase, when a set of features are given 
as input, DBNN makes a prediction about its class based on 
its current prior and likelihood. If the prediction is wrong, 
it updates its prior, which is called weight, using a gradient 
descent algorithm. The important point is that the gradient 
descent is computed based on the differences in the esti- 
mated probabilities for the predicted class and the real class 
of the sample and hence the computation is largely devoid 
of fluctuations due to outliers. 

In principle the classifier is able to compute the proba- 
bility for the sample to be a member of each of the classes. 
But in practical situations, the most likely class will have 
the highest confidence and the second one might have a 
good share of the remaining confidence. In our software im- 
plementation, the classifier is able to make two predictions 
along with the associated confidence it has in each of the 
predictions. This information can be extended to identify 
outliers, reduce contamination and to some extent, evaluate 
the limitations of the features. 

The Bayesian classifier we use can help us to identify 
unrepresented and rare examples existing in the data. Ac- 
cording to the Bayesian theorem, the posterior probability 
is the normalised product of the likelihood and the prior for 
an outcome. Likelihood is the probability with which similar 
events have appeared in the past. Since the likelihood for an 
unseen event is zero, the classifier will flag it as 'rejected' 
and will not be classified. Objects with flag 'rejected' can be 
individually studied and can be subsequently added to the 
training sample to efficiently identify completely new classes 
of objects. 

The posterior probability will be high when the likeli- 
hood and the prior are high. Thus it is often referred to as 
the 'confidence' in a prediction. A high confidence usually 
means that the object occupy a location in feature space 
that is well within the boundary of the cluster formed by 
its class. However, this is not an assurance that the object 
is always correctly classified. It may happen that a negligi- 
bly small fraction of objects within that cluster belongs to 
a different class. Because of their small number, the likeli- 
hood for them tends to zero and it might happen that such 
objects will never be correctly identified. On the flip side, 
this helps the classifier to efficiently learn the boundaries of 
a class even when there are outliers in the data. Since our 
classifier bins the data and separately computes the likeli- 
hood for each bin, one can optimise the bin width to im- 
prove the sensitivity of the likelihood estimates in favour of 
the marginally represented samples in the data. 

The training and testing procedures for our classifier 
are explained in the DBNN home pagelf] 



4 TRAINING AND TESTING DATA 

As mentioned earlier, photometric correlations of colours 
with the spectral class of objects are well established in the 
literature. The SDSS CAS server has 'SpecPhoto' table that 

^ http:/ /www. iucaa.ernet.in/~nspp/dbnn. html 



Table 1. Colour cuts used for preparing training data. 



colour 


Lower Limit 


Upper Limit 


u-g 


-0.25 


1.00 


g-r 


-0.25 


0.75 


r-i 


-0.30 


0.50 


i-z 


-0.30 


0.50 



provides the five SDSS magnitudes and spectroscopic classi- 
fication of all the primary objects selected for spectroscopy 
by the survey. These spectroscopic classifications are auto- 
mated in the SDSS pipeline and thus in the case of a small 
fraction of the objects, the classifications are in error. A fol- 
low up visual verification of the classification has thus been 
carried out for quasars and this is available separately on 
the SDSS web site as the final SDSS quasar catalogue. In 
addition to our own visual examination of the spectra, we 
incorporat ed the corrections in t he SDSS DR7 final quasar 
catalogue (jSchneider et al.ll2010l ) in our training data. 

Th e five bands of SD SS can give four independent 
colours. ISinha et al.l (|2007l ) had shown that good accuracy 
on quasar classification is possible with the use of the four 
independent colours and one pivot magnitude with DBNN 
and our work is an extension of their study. While their clas- 
sification accuracy on the test data was about 97 per cent, 
we find that the use of all the ten colours CC2) which can 
be formed from the five SDSS magnitudes, and one mag- 
nitude can improve the accuracy to about 99 per cent on 
the spectroscopically confirmed test data. There is no ad- 
ditional information in the newly added correlated colours. 
However, finer details of the probability distribution func- 
tion (pdf) that are unresolved when only the independent 
colours are used become distinct when all the colours are 
used. A Bayesian classifier differentiates objects based on a 
likelihood that is estimated from the pdf and hence its reso- 
lution plays a significant role in classification. We illustrate 
this in Fig. [2] by considering objects from a narrow region, 
0.14 < u-g < 0.26, of one feature and plotting the distri- 
butions of the same objects in the remaining nine features. 
It is seen that despite the fact that some of the colours are 
correlated, the probability distribution functions looks dif- 
ferent in each representation. The Bayesian estimator in our 
classifier has made use of this, which would be missing if we 
use only the independent colours, to efficiently separate the 
overlapping features of objects belonging to different classes. 

It may be noted that the resolution in the feature space 
increases when the bin width is decreased. However, narrow- 
ing down the bin sizes to improve resolution in colour space 
requires an unlimited reserve of observations in each bin so 
that pdfs can be plotted. We thus restrict our analysis to 
regions in the feature space where maximum spectroscopic 
classification is available. This is the first criterion we had 
for selecting the particular window region for our study. Al- 
though it is only a small region of the colour space, it has 
over 46 per cent of the available SDSS spectroscopy in unre- 
solved detections. The colour cuts used by us to define this 
region are given in Table [1] 

As said, we use all the ten colours, u-g, u-r, u-i, u-z, 
g-r, g-i, g-z, r-i, r-z, i-z plus the M-band psf magnitude as 
input features for our classifier. During the training process. 




Figure 2. Distribution in various colours of spectroscopically identified objects from the region 0.14 <u-g< 0.26 in the ten dimensional 
colour space. The colour code is red for quasars, blue for stars and black for galaxies. 
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by definition, the classifier learns the correlation between 
colour and spectroscopic types in the training data. This is 
stored and when similar features are presented to the clas- 
sifier at a later time, it is used to predict the likely spectro- 
scopic classification of an object. To make comparison easy, 
the predictions were assigned the same labels as used by 
'SpecClass' in the 'SpecPhoto' table of SDSS (See section 

ID. 



4.1 Data issues and resolutions 

For going to fainter levels, we assume that within the win- 
dow region of the selected colour space, the likely variations 
in colour at fainter levels can be learned by the classifier 
from the colour dispersions observed at brighter magnitudes. 
However, it is possible that a fainter object is at a different 
redshift compared to the bright object of its kind and that 
its observed spectrum and hence colour is altered due to 
redshift. In the case of quasars, it is known that the redshift 
distribution at brighter magnitudes (J < 21.2) are sirnilar t o 
that at fainter magnitudes (J > 21.2) (|Koo fc Kronlll988h . 
Hence, if we have spectroscopically confirmed bright quasars 
from all redshift ranges in our training data, the classifier 
can learn the intrinsic variations due to redshift differences 
and then reliably extend this information to classify quasars 
at fainter magnitudes. This is another reason for restricting 
our study to a small region in colour space that has maxi- 
mum spectroscopically confirmed quasars. 

The colour cut we used is so selected that it avoids 
most of the late type stars and faint galaxies that can come 
in as contaminants in our catalogue. However, to include 
faint stars and galaxies that might have different colours 
compared to their brighter counterparts and thus might have 
entered the colour window, we took a few representative 
faint objects that have spectroscopic confirmation of their 
class in 2dF and included them in our training sample. These 
gave us representative training samples with spectroscopic 
confirmation to 22nd magnitude in SDSS i-band. Our final 
training data thus had 14,356 unresolved spectroscopically 
confirmed objects. 

For preparing the catalogue, we selected objects 
that have fiag BINNEDl set and excluded objects 
with flags EDGE, NOPROFILE, PEAKCENTER, 
SATURATED, NOTCHECKED, PSF_FLUXJNTERP, 
DEBLEND_NOPEAK, BAD_COUNTS_ERROR or IN- 
TERP_CENTEeE|. At fainter levels, the SDSS magnitude 
error estimate becomes unreliable. Since the same exposure 
time is used by SDSS for the entire frame, there will only 
be a fewer photons from faint objects. This significantly 
affects the signal to noise ratio and puts an upper limit on 
the faintness that can reliably be used to extract colour 
information. Though not compensatory, we restricted the 
upper limit for magnitude errors to i ~ 0.4 and g ^ 0.2 when 
preparing the catalogue. However, as SDSS fiags are not 
validated at fainter magnitudes, the level of contamination 
in our catalogue due to data artefacts is highly uncertain 
beyond i ^ 21.3. 



^ http:/ /www. sdss.org/dr7/products/catalogs/flags. html 



4.2 Preparation of training data 

The selected region has 106,466 spectroscopically confirmed 
unresolved objects. For preparing a training sample, we ini- 
tially took a random set ~ 10 per cent of this data. It was 
found that the random sampling did not give a good rep- 
resentation of the sparsely represented examples, like late 
type stars of which there are only 301, or galaxies that are 
only about 852 in number in our spectroscopically confirmed 
data. We used the following strategy to handle this issue. 
When a feature vector is presented to the classifier during 
the testing round, it checks whether it has seen an example 
that looks similar to it earlier. If the test fails, then the clas- 
sifier will flag that object as 'rejected' without attempting 
classiflcation. We grouped such flagged objects and added 
representative samples from them into the training data so 
that the classifler will be able to classify them. As a result, all 
the under represented examples got included in the training 
data. 

Another issue is caused by redshift that makes the 
colours of an object appear similar to objects belonging to 
another class. Since this causes the outcomes for the two 
classes to be equally likely, the Bayesian probability esti- 
mated by the classifier for objects in such regions will be 
lower. Our classifier use this information to find regions that 
require extra training samples so that the minor differences 
in the features may be learned to separate out the classes 
efficiently. Thus our training data has more examples from 
regions in the colour space that are occupied by objects from 
different classes. 

Another problem we observed in random sample selec- 
tion was that objects that had higher representation in the 
data always dominated in the training sample. This causes 
the dominated class to bias the classifier to its favour. In 
such a case one has to either remove the excess examples 
from the training data or add more examples from the un- 
der represented class to the training data as a compensatory 
measure. All these requirements together gave us 14,356 ob- 
jects with spectroscopic confirmation as our training sample. 
This is composed of 3,968 stars, 9,236 quasars, 851 galaxies 
and all the 301 late-type stars. Out of the 14,356 objects in 
the training data, 2,806 are from 2dF which include 1,025 
stars, 1,470 quasars, 278 galaxies and 33 white dwarfs with 
psf magnitude of i ranging from 17 to 22 mag. As men- 
tioned earlier, the 2dF objects were added to improve the 
prediction accuracy of our classifier at the fainter magni- 
tudes where SDSS spectroscopy was not available. Since the 
training data was constructed from the SDSS and 2dF spec- 
troscopic data, the object type for all of the data are known. 

During the training process the Bayesian likelihood for 
each of the training example is computed and related infor- 
mation are stored by the classifier in its runtime file. This 
information is used later when new data are presented for 
classification. 

To evaluate the performance of the classifier, the trained 
classifier is used to predict the class of the test data. Since 
the class of the objects in the test sample is known, the pre- 
dictions can be easily compared. It is found that the classifier 
correctly predicted 99.5 per cent of stars and 99.96 per cent 
of quasars. In Table [S] we summarise the actual number of 
objects in the test data set, the predicted numbers in each 
class and the accuracy of prediction. 
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Table 2. The accuracy of our classifier as compared to the SDSS DR7 spectroscopic classification of the test sample. 
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Object 




DBNN Predictions 








Type 


Star 


Galaxy 


Quasar 


Star-Late 


Completeness 


Contamination 


Star 


18,337 





90 





99.51 % 


0.47 % 


Galaxy 


27 


705 


120 





82.74 % 


0.00 % 


Quasar 


34 





86,852 





99.96 % 


0.28 % 


Star-Late 


25 





34 


242 


80.40 % 


0.00 % 


Total 


18,423 


705 


87,096 


242 


99.69 % 


0.31 % 



In addition to the likely spectral type, the classifier also 
returns the computed Bayesian posterior estimate for the 
prediction, which is a measure of the confidence the classi- 
fier has in the prediction. Usually an object predicted with 
high confidence is predicted correctly. But sometimes, it may 
be noted that an object is predicted with high confidence to 
a wrong class. This can happen when the colour of the ob- 
ject becomes similar to the colour of objects in another class 
which densely populate that region of the colour space. The 
other possibility is that the assigned object label is incor- 
rect. The latter enabled us to find incorrect spectroscopic 
labelling of quasars as galaxies in SDSS data, which we de- 
scribe in the next section. 



4.3 Analysis of Test Failures 

The photometric classification of objects based on colours 
appears to be straightforward, but it has been observed that 
colours of different object types sometimes overlap due to 
various reasons. In our test data there were 34 quasars that 
got incorrectly classified as stars by our classifier. The lo- 
cus of the colour feature space that forms the failed cases 
(black dots) in Fig. [3]shows that the colours of these objects 
lie mostly along the boundary of the stars and quasars. For 
clarity, we did not include galaxies in the plot. What causes 
this overlap? Fig. |4] shows that most of the failures cluster 
around some specific patches of redshift at which the appar- 
ent colours of quasars are similar to those of some dominant 
stellar populations. Some of these populations, like that n ear 
z ~ 0.675, look like white dwarfs (|Richards et al.lbOOQal '). 

In addition to these, errors in extinction corrections and 
intrinsic differences between objects within the same spec- 
troscopic class also account for the not so prominent predic- 
tion errors. If there are sufficient data, during the process 
of training, the classifier learns this limitation and assigns 
a lower confidence to the objects belonging to such overlap- 
ping regions in the feature space. This is shown in the lower 
panel of Fig. 21 By using a confidence cut off, such objects 
can be removed to obtain a better sample of quasars at the 
cost of reduced completeness. This is one of the advantages 
of having a probability estimate attached to each prediction. 
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Figure 3. A two dimensional projection of the feature space of 
quasars (blue) and stars (green) along with quasars mistakenly 
identified as stars and stars mistakenly identified as quasars (black 
* marks) are shown. The failures are at the colour boundary be- 
tween quasars and stars in the ten dimensional feature space. 



logue (DR7Q). We thus updated our training sample with 
the DR7Q classifications and used it to produce the photo- 
metric catalogue. However, the corrections in the training 
data resulted in only very minor changes to our catalogue, 
less than 0.5 per cent of its previous estimates. This is be- 
cause changing a few labels in the data need not necessarily 
bring forth considerable changes in the estimated probabil- 
ity distribution function used by the Bayesian classifier. This 
advantage of our method helps the classifier to robustly han- 
dle unseen outliers that inevitably exist in any data. 



The SDSS spectral pipeline had sometimes labelled the 
objects incorrectly in the SDSS tables. Many of these errors 
were later cor rected in the latest (fi fth) SDSS quasar cata- 
logue release ijSchneider et al.ll2010t ). We initially used the 
classifications from SDSS tables for training our classifier. 
On visual examination of the failures, we realised that some 
of the objects had been incorrectly labelled in SDSS tables 
and were subsequently corrected in DR7 final quasar cata- 



5 THE CATALOGUE 

The catalogue is created using our trained classifier on the 
prediction data that was described earlier. The prediction 
data are similar to the training and test data with the only 
difference being that they have no known class label. 

There are 6,038,247 rows (object detections) in our cat- 
alogue. These are classified into 2,430,625 quasars, 3,544,036 
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Figure 4. The SDSS colours of stars and quasars are indistin- 
guishable in a few patches of redshift. In the upper panel, the 
histogram in orange indicates the correctly predicted quasars and 
the histogram in black represents the distribution of quasars that 
were incorrectly classified as stars (failures). The bottom panel 
shows how the confidence of the classifier falls in these patches. 
The black dots represent the confidence of failed objects. It may 
be noted that, despite the reduced confidence of the classifier in 
the region shown, most of the quasars were correctly identified. 
The combined failures from galaxies and stars together is only ~ 
0.5 per cent. 



stars and 63,586 galaxies by our classifier. The distribution 
of I magnitude of the objects in our catalogue is shown in 
Fig. [5] According to the classifier predictions, the distri- 
bution of stars peaks at i ~ 20, followed by quasars and 
galaxies at ~ 21.5 with truncation at fainter magnitudes 
due to the photometric limit of the SDSS survey in our 
colour box. It may also be noted that the colour distri- 
bution of the 6 million predictions in the catalogue looks 
similar to that shown in Fig. [T] The catalogue contains 
3,412,751 (57 per cent) objects fainter than psf i magni- 
tude of 20.2, which is the limiting m agnitude of SDSS quasar 
spectroscopy (jRichards et al.ir2002l ) and of these 1,99,690 are 
predicted as quasars, 1,352,871 as stars and 60,190 as faint 
galaxies. 

It was found that 69 per cent of the objects in the cata- 
logue are predicted with 100 per cent confidence while only 
4 per cent have less than 90 per cent and the remaining 
27 per cent have confidence between 90 and 100 per cent. 
This means, approximately 69 per cent of the objects are 
well resolved in the colour space while the remaining 31 per 
cent where sharing the colour space with objects from dif- 




18 20 
i magnitude 




24 



Figure 5. Overall magnitude distribution (upper panel) of 
quasars, stars and galaxies in our catalogue is shown in black 
extending from 14th magnitude to 24 magnitude in SDSS i-band. 
The individual distributions show how the surface density of the 
types changes. First the stars (green), then quasars (blue) and 
galaxies (pink) peak in the distributions as we move towards 
fainter magnitudes. In the lower panel, a 3D colour cube [u-g, 
g-r, r-i] of the 6 million predictions in our catalogue, colour coded 
as galaxies (larger yellow points are used to make their locus vis- 
ible), stars (green) and quasars (blue), is shown. 



ferent classes. Only 1 per cent objects were predicted with 
confidence less than 70 per cent, which are from regions 
where there is substantive overlap between different classes. 
As stated earlier, lower posterior prediction probability oc- 
curs when the events are equally hkely, say, the colour of 
one type of object overlaps with another in the data. A plot 
of normalised cumulative histogram of these probabilities in 
the catalogue is shown in Fig. [6] 

Objects with the same confidence value within a small 
hypercube of the feature space forms the most similar ob- 
jects in the entire data. One can build subset of objects 
grouped on the basis of their confidence value for follow up 
studies. For example, to study objects of a specific kind, in 
addition to other regular identifiers like colour etc, the con- 
fidence measure given in our catalogue for its kind can be 
used as an additional dimension. 
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Table 4. Sample rows from our photometric catalogue 


based on SDSS DR7 (Please see 


text for column references). 






R.A. 


DEC 




IViOoL JT iOUdyUie 


Confidence 1 

% 


Secoud Kdost Probable 


Confidence 2 

% 


587732772667326484 


185.24782587 


10.87881822 


18.59683 


3 


100.0000 


1 


0.00000 


587732772667326517 


185.18802403 


10.93259451 


19.817682 


1 


99.99999 


4 


0.00001 


587732771042623616 


152.67642584 


8.17636451 


19.76019 


2 


100.0000 


3 


0.00000 


587732771039346749 


145.17463514 


7.29821365 


18.57126 


3 


100.0000 


4 


0.00000 


587732771039871127 


146.37899605 


7.54401918 


19.97955 


4 


98.93777 


3 


1.06223 


587732771043016884 


153.62477833 


8.34608604 


19.97215 


1 


100.0000 


3 


0.00000 


587732771042296010 


151.99741353 


8.11097072 


19.46754 


2 


88.48774 


3 


11.51226 


587732771039608895 


145.81816028 


7.50764781 


18.41648 


3 


100.0000 


1 


0.00000 


587732771047866444 


164.72707514 


9.09094786 


19.79277 


4 


99.99086 


3 


0.00914 


587732771043410105 


154.58767573 


8.27519667 


19.55149 


3 


95.53498 


1 


4.46502 




Confidence 



Figure 6. Plot of the normalised cumulative histogram of the 
predicted Bayesian posterior probability for quasars (blue), stars 
(green), galaxies (pink) in the catalogue are shown. The individ- 
ual values with each prediction can be regarded as the confidence 
the classifier has in that prediction. This information may be used 
to subgroup objects for follow up studies on the basis of similar- 
ities. 



Table 3. The distribution of objects in our photometric cata- 
logue. 



Class 


Predicted No 


Main Sequence Stars 


3,540,337 


Galaxy 


63,586 


Low Z Quasars 


2,257,905 


High Z Quasars 


172,720 


Late type Stars 


3,699 


Total 


6,038,247 



5.1 Catalogue Format 

The full catalogue contains 8 columns of 6,038,247 SDSS 
sources that uses the same numeric labels in SDSS tables 
to classify objects into stars, galaxies, low redshift quasars, 
high redshift quasars (HizQSO) and late-type-stars. The 
quasars in the catalogue include BAL quasars, BL Lacer- 
tae objects and other AGNs. Although we assign separate 
labels for high and low redshift quasars, in this first release, 
we are not classifying the different types of AGNs and its 



sub classes and are grouping them all under the single name 
quasars. A summary of the objects predicted by the classifier 
is given in Table |3] A sample of 10 entries in the catalogue 
is given in TablelH The content in each column is as follows. 

(i) SDSS ID : SDSS photometric object ID 

(ii) R.A. : Right ascension in decimal degrees (J2000) 

(iii) DEC. : Declination in decimal degrees (J2000) 

(iv) i mag : SDSS i -band PSF iibercalibrated asinh mag- 
nitude 

(v) Most Probable : Most probable class of the object, 
represented by integers with the same meaning assigned by 
SDSS in their tables. 

(vi) Confidence: The confidence the classifier has in the 
most probable class. 

(vii) Second Most Probable : The second most probable 
class of the object, same format as Object Type. 

(viii) Confidence: The confidence the classifier has in the 
second most probable class. 



5.2 Cross-Matching 

We carry out cross matching between our catalogue and sev- 
eral other catalogues which contain a subset of objects from 
our catalogue. The purose of the cross matching is two fold: 

(i) To identify potential limitations of the catalogue based 
on available data. This is important because the whole exer- 
cise is based on a single survey and has not considered any of 
the inherent constraints in the survey. In many surveys the 
target selection algorithms are optimised to detect a partic- 
ular category of candidates and when the training data is 
derived from it, it is not necessary that it would be repre- 
sentative for the kind of objects observed by other surveys. 
Cross matching can reveal such biases if they exist. 

(ii) To estimate the quality of classification at brighter 
magnitudes where existing spectroscopy can provide a rep- 
resentative sample. This is significant because, even with 
modern technology, spectroscopic confirmation of all bright 
objects is impossible and one has to adopt other methods to 
determine their number density of various types. Since dif- 
ferent surveys cover a different set of objects depending on 
their objective, robustness of a method can be estimated by 
cross matching predictions with spectroscopic confirmations 
done by different surveys. 
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Table 5. Summary of the matching of our catalogue predictions with some existing catalogues. 



DBNN Predictions Failures as per catalogue 



Cat. Code 


Quasar 


Galaxy 


Star 


Quasar 


Galaxy 


Star 


Accuracy 


i mag Range 


Ref 


2DF 


5976 


238 


1535 


122 





52 


98% 


17.0 - 


22.0 


1 


XBH 


212 

















100% 


15.8 - 


20.5 


2 


ASFS 


1088 


12 


31 





12 


31 


96% 


14.5 - 


22.1 


3 


BATCS 


21 








3 








86% 


18.1 - 


20.5 


4 


CGRBS 


265 


1 








1 





100% 


14.7 - 


21.5 


5 


DLyaQ 


21 





1 








1 


95% 


16.5 - 


19.4 


6 


F2QZ 


186 


1 


3 





1 


3 


98% 


16.6 - 


21.0 


7 


KFQS 


144 


2 


13 


3 


1 


7 


94% 


16.8 - 


20.6 


8 


LQAC 


61504 


17 


267 





17 


267 


100% 


14.7 - 


22.3 


9 


LQRF 


60280 


14 


219 





14 


219 


100% 


14.7 - 


21.7 


10 


BZC 


249 


4 


2 





4 


2 


98% 


15.0 - 


21.0 


11 


PCS 


53 





2 








2 


96% 


15.1 - 


18.5 


12 


ROSA 


1134 





1 








1 


100% 


15.5 - 


20.5 


13 


SQ13 


65223 


55 


395 





55 


395 


99% 


14.7 - 


22.8 


14 


SQR13 


7 





21 


7 








75% 


16.3 - 


20.3 


14 


DR7Q 


79140 


17 


341 





17 


341 


100% 


14.9 - 


21.8 


15 


SSSC 


82 


2 


1171 


82 


2 





93% 


14.9 - 


21.5 


16 


SSA13 


5 





1 











83% 


17.8 - 


20.8 


17 


XMMSS 


37 





5 


1 





2 


93% 


14.9 - 


20.7 


18 


SDSS/XMM 


580 

















100% 


15.2 - 


20.5 


19 


RASS/2MASS 


6 

















100% 


15.5 - 


18.4 


20 


CAIXA 


16 

















100% 


15.1 - 


17.8 


21 


WDMB 


20 





106 


20 








84% 


15.3 - 


20.5 


22 


PMS 


639 


6 


19596 


639 


6 





97% 


14.8 - 


20.2 


23 


GLQ 


2 

















100% 


18.8 - 


19.1 


24 



(1) Groom et al.2009a; (2) Kelly et al.2008; (3) Healey et al.2007; (4) Zhang et al.2004; (5) Healey ct al.2008; (6) Curran et al.2002; (7) 
Cirasuolo et al.2005; (8) Maddox et al.2008; (9) Souchay et al.2009; (10) Andrei ct al.2009; (11) Massaro et al.2009; (12) Kuraszkicwicz 
et al.2004; (13) Suchkov et al.2006; (14) Veron-Cetty & Veron 2010; (15) Schneider ct al.2010; (16) Skiff 2009; (17) Fomalont et al.2006; 
(18) Watson et al.2009; (19) Young ct al.2009; (20) Haakonscn & Rutlcdge 2009; (21) Bianchi ct al.2009; (22) Heller et al.2009; (23) 

Gould & Kollmeier 2004; (24) Oguri et al. 2008; 



To this end, we match our predictions with several other 
catalogues which contains some of the objects in our cata- 
logue and summarise the results in Table [5] The tables also 
include the magnitudes covered by the matched objects in 
the catalogues. The list of unconfirmed predictions are given 
in Table [6] and a detailed discussion of the cross validation 
results are given in Table [51 Multi-wavelength catalogues 
covering X-ray, optical and radio with spectroscopically con- 
firmed objects were given preference in selecting catalogues 
for cross matching. 

All cross- matching is done by matching the celestial co- 
ordinates of the objects within 1 arcsec of their value in our 
catalogue. Since the selected region is rich in quasars, we 
selected a few spectroscopic surveys with confirmed quasars 
as reference data for our catalogue. The surveys that we 
used for this are marked with a in Table (5] This resulted 
in the identification of 90,249 spectroscopically confirmed 
quasars with objects in our catalogue. We find that we had 
correctly classified 89,549 (~ 99 per cent) of the objects as 
quasars. These included 10,230 (~ 11 per cent) spectroscop- 
ically confirmed non-SDSS quasars, of which, 9,887 (97 per 
cent) were correctly classified while labelling 48 of the ob- 
jects as galaxies and 295 as stars. In the 9,887 non-SDSS 
quasars, 5,167 were fainter than SDSS quasar spectroscopic 
upper limit of 20.2 and 5,036 (97 per cent) were correctly 
predicted as quasars by our classifier. Comparing our cata- 
logue with spectroscopically identified stars from 2dF, Skiffs 



catalogue of Stellar Spectral Classification, Proper Motion 
Catalogue from SDSS n USNO-B (details and references in 
Table[9]) and SDSS DR7 catalogue resulted in 36,645 stars in 
the window region selected by us. Of these 35,830 98 per 
cent) stars were correctly identified by our classifier. Com- 
parison with 746 spectroscopically confirmed galaxies from 
SDSS DR7 and 2dF resulted in the correct identification of 
546 galaxies. 

Comparison with DR7Q gave 79,498 quasars of which 
79,140 were correctly identified by our classifier. The 341 
quasars that were predicted as stars by our classifier are 
fro m a few patches of redshift that includes z ~ 0.675 and 
2.3 (IRichards et al. I l2009al : ISchneider et"alll2007l ) , where the 
colours of quasars are known to merge with the colours of 
stars. These regions are identical to what was shown in the 
upper panel of Fig. [4] where, orange colour represent cor- 
rectly predicted quasars and black represent quasars incor- 
rectly classified as stars. 

We found that our classifier could correctly recover dif- 
ficult and interesting objects, like 75 per cent of the rejected 
objects in the 13th e dition of the Quasar and Ac tive Galac- 
tic Nuclei catalogue (|Veron-Cettv fc Veron|[2010l ). These ob- 
jects were earlier counted as quasars and were recently iden- 
tified as belonging to some other class. 

When matched with 2dF-SDSS LRG and QSO Survey 
(|Croom et al.l l2009al ) final spectroscopic quasar catalogue 
covering an area of 191.9 deg^ resulted in 30,261 objects. 
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Table 6. Unconfirmed predictions in the catalogue. 





DBNN Predictions 


Ref 


Cat. Code 


Quasar 


Galaxy 


Star 


CNDWF 


365 


4 


23 


25 


ARC 


38 








26 


XMM2iS 


3176 


27 


287 


27 


ROSAT-FSC 


24 





9 


28 


XMMCOSMOS 


88 





8 


29 



(25) Brand et al.2006; (26) Asian ct al.2010; (27) Xmm-Ncwton 
Survey Science Centre, C. 2008, Vizier Online Data catalogue, 
9040, 0; (28) Veron-Cetty et al.2004; (29) Cappelluti et al.2009; 
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Figure 7. The histogram of quasars in our catalogue that are 
fainter than SDSS spectroscopic magnitude limit in i-band and 
are having spectroscopic confirmation by other surveys is shown. 
The black histogram are the objects correctly identified by our 
classifier and the red are the failed ones. The counts on y-axis arc 
shown in log scale for clarity. 



Of these only 7,293 objects have spectroscopic confirmation 
from 2dF and of those, 97.8 per cent objects were correctly 
predicted. The correctly identified quasars in this data in- 
cludes two gravitationally lense d quasars, SPSS J12 16+3529 
and SDSS J0832+0404, from lOguri et all (l2008f) and 2 
damped Lyman alpha quasars from lEUison et all (jiooi). 
A distribution of the correctly identified and failed faint 
quasars in our catalogue that have spectroscopic confirma- 
tion is shown in Fig. [T] 

A photometric (Richards-f2009 ) catalogue of a b out 1.2 
million quasars was constructed by [Richards et al] (|2009al ) 
using an 8-dimensional photometric classification scheme. 
They used a Bayesian based kernel density estimator to 
identify quasar candidates from SDSS DR6 with a limiting 
magnitude of i=21.3 and expected completeness of ~ 70% 
to type 1 quasars. 

The Richards-|-20Q9 catalogue has 841,174 objects 
within the colour window used for our study. We used our 
trained classifier to determine the probable class of those ob- 
jects and it labelled 734,803 (85 per cent) as quasars, 98,449 
as stars and 7,922 as galaxies. A histogram of the distribu- 
tion of the magnitudes of the matches shown in Fig. [8]shows 
the good agreement between the catalogues at brighter mag- 
nitudes. The disagreement is mostly at fainter magnitudes. 
A colour-colour plot of the classified objects is shown in Fig. 
[9] To get some estimate of the quality of our classification, 
we compared these predictions with existing spectroscopic 
catalogues and give the completeness and contamination in 



Figure 8. The magnitude histogram of quasars from the photo- 
metric catalogue by Richards Gordon for objects with flag good 
^ (red) is compared with the predictions of DBNN of the same 
as quasars (blue), stars (green) and galaxies (black). 




Figure 9. A sample colour-colour plot of the distribution of ob- 
jects in the photometric catalogue of Richards Gordon for objects 
with flag good ^ that are predicted by DBNN to be quasars 
(blue), stars (green) and galaxies (pink) is shown. 



tables [7] and [S] These tables show that our method gives 
better accuracy and lesser contamination in its prediction. 
Given that the data is already categorised as possible quasar 
candidates, a marginally higher contamination rate for stars 
and galaxies are understandable. This also explains why the 
number of stars in the table are much lesser than the number 
of quasars. 



5.3 Quasar Number Density 

Our catalogue has about twice as many quasars as classified 
by Richards-|-2009. Does this mean that we are overestimat- 
ing the quasar luminosity function? One way to evaluate 
the reliability of the catalogue is to compare the predicted 
quasar number density in our catalogue with what has been 
observed. Our catalogue of unresolved objects from i ~ 14 
to 24 spans an area of 11,663 square degree with an expected 
coverage in redshift up to 2.6. Our study is restricted to un- 
resolved objects to minimise the effect of the host galaxy on 
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Table 7. A comparison of our predictions for objects with good 
^ from Richards+2009 catalogue with objects that have spec- 
troscopic confirmation from 2dF and other spectroscopic surveys. 



Confirmed 



Juasars m 



DBNN classification as 



Spectral Class 


Richards+2009 
Catalogue 


Quasar 


Star 


Galaxy 


Quasar 


94858 


94445 


359 


54 


Star 


1357 


281 


1074 


2 


Galaxy 


323 


91 


22 


210 


Accuracy 


98.26 % 


99.61 % 


73.81 % 


78.95 % 
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Table 8. Completeness and contamination details of predictions 
by DBNN as per Tabled 



Object Type 


Completeness % 


Contamination % 


Quasars 


99.56 


0.39 


Stars 


79.15 


26.19 


Galaxies 


65.02 


21.05 



the colour estimates. ICroom et al.l (|20091J ) have estimated 
the quasar luminosity function at 0.4 < z < 2.6 based on 
2dF-SDSS LRG and QSO (2SLAQ) survey having redshift 
range similar to our catalogue. In the upper panel of Fig. 
IIUI the g-band quasar number counts for our complete cata- 
logue and in the lower panel, the comparison with the g-band 
quasars number counts (0.4 < z < 2.1) from the 2SLAQ 
survey (after applying corrections for coverage, photomet- 
ric and spectroscopic completeness) SGP (blue) and NGP 
(red) strips respectively are shown. It may be noted that 
the counts agree at brighter magnitudes where as at fainter 
levels, our catalogue produces marginally larger counts be- 
cause our redshift window is ~ 50 per cent wider, extending 
to z ~ 2.6. We assume that our completeness for redshift 
< 2.6 is comparable to what has been depicted in Fig[T]so 
that a simple comparison between the two is possible. The 
number density of quasars [z ^ 2.6) as per our catalogue 
is ~ 116 deg~^ at limiting magnitudes g ~ 22 and falls to 
~ 54 and ~ 18 quasars deg~^ at g = 21 and 20 respec- 
tively. It is also noted that our number count, which most 
likely consists of quasars with z < 2.6, remains less than or 
equal to the redshi ft unbound number counts reported by 
IKoo fc KronI ()l988[ ) at the respective magnitudes. Thus a 
factor in the excess of quasars in our catalogue as compared 
to Richard-|-2009, which is limited to i 21, is to be under- 
stood as the contribution of objects from fainter magnitudes 
and as such is not contradicting the earlier estimates of the 
quasar luminosity function. 



5.4 Contamination and Completeness 

Completeness is defined as the ratio Ncor/Notj where Ncor is 
the number of correctly predicted objects (say quasars) and 
Nobj is the actual number of objects in that class. Likewise, 
contamination is defined as the ratio of incorrectly labelled 
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Figure 10. The upper panel shows quasars number counts {Ag = 
0.25, ~ 0.15 < z <~ 2.6) to the full magnitude range in our cat- 
alogue. In the lower panel, our predictions (black) are compared 
with observed quasar number count in the redshift range 0.4 to 
2.1 from the 2SLAQ survey SGP (blue) and NGP (red) strips. 
Both show close agreement with our catalogue. The marginal de- 
viations may be ignored considering the fact that our redshift 
window is ~ 50 per cent wider than {z < 0.4, 2.1 < z <r^ 2.6) the 
redshift coverage of the other two. 

objects to the total number of objects in a class. It may also 
be defined as 1— Accuracy. 

An ideal situation would be one where the completeness 
is 100 per cent and the contamination is zero. However, this 
is not possible in reality. We have discussed the difficulties 
in the classification of some of the objects and a reasonable 
solution would be one where the sample is 'complete' beyond 
a certain threshold and the contamination is the minimum. 
The classifier assigns a confidence value to every prediction 
it makes, which may be meaningfully used for this purpose. 

An important point to consider here is the scientific goal 
at the end of the classification process. If we want to get a 
high level of completeness, then we would want to keep a low 
cut-off value for the confidence that can have any value be- 
tween 1/A'' and 1, A'^ being the number of predefined classes 
to be isolated. However, if we want to target quasars for 
spectroscopy, we might want to obtain quasar candidates 
that have a high chance to be a quasar. The selection of a 
suitable lower cut-off based on the confidence value allows 
one to do it. As it may be noted from the cumulative confi- 
dence plot (Fig. [6]), a value of 55 per cent confidence could 
be regarded as a good cut-off for most purposes. However, 
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this may also result in the loss of quasars at some specific 
patches of redshift where the confidence value drops because 
of the merger of colours from different classes. This is the in- 
evitable trade-off between completeness and contamination. 



5.5 Future Plans 

Photometric estimation of redshift will significantly increase 
the usability of our catalogue by giving an additional dimen- 
sion of their distributions. Secondly, the colour cuts that we 
used might have left out many interesting objects. Can we 
meaningfully classify objects from regions where there are 
fewer number of training samples? One option is to go to 
multi-band observations with capabilities to handle miss- 
ing attributes. Many such objects are of specific interest for 
target selection for astronomical observations such as vari- 
ability and so forth. These and related issues are now being 
investigated. 
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6 SUMMARY 

In this paper, we develop a machine learning algorithm 
based on Bayes theorem and train it on the colours of spec- 
troscopically confirmed objects from SDSS to produce a cat- 
alogue of over 6 million unresolved photometric detections 
in the SDSS DR7, classifying them into stars, galaxies and 
quasars. These objects are from a small region of the SDSS 
colour space that has about 106,466 spectroscopically con- 
firmed point sources and about 6 million photometric detec- 
tions without spectroscopy, dominated by quasars and main 
sequence stars. We go to the limiting magnitudes of SDSS 
photometry and predict the class of the objects with a set of 
logically derived constraints. Our predictions are compared 
with other deep sky surveys in X-ray, optical, infra-red and 
radio and the results indicate that the method produces a re- 
liable classification of faint objects using only the five SDSS 
magnitudes. The full catalogue and the data are available in 
electronic form. The high accuracy of matching and the abil- 
ity to go to fainter levels of magnitude is expected to make 
our classifier a valuable addition to photometric classifica- 
tion and candidate identification for some of the upcoming 
deep sky surveys. 

The catalogue is limited to the colour window used for 
this study and hence the completeness of the catalogue only 
refers to objects within this window. We have noted that 
many of the failures have occurred at specific patches of 
redshifts and are in agreement with literature. However we 
wish to note that the true nature of objects beyond i ~ 21.3 
are unconfirmed and the artefacts in the data at fainter mag- 
nitudes could have resulted in inaccurate predictions by our 
classifier. The predictions give only an approximate estimate 
of the possible distributions at fainter magnitudes that may 
be confirmed and periodically improved when newer surveys 
spectroscopically confirm the nature of those objects. 
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The SQL query used to download the data from SDSS : 
SELECT 

s.objID as ObjID, 
s.ra as Ra, 
s.dec as Dec, 
(u.psfMag_u 
(u.psfMag_g 
(u.psfMag_r 
(u.psfMag_i 
(u.psfMag_z 
s.type as type, 
s.z as RedShift, 
s.SpecClass as SpecClass 

FROM 

UberCal u, 

SpecPhoto s into SpecUberMag 
WHERE 



s . extinction_u) as psfMag_u, 
s . extinction_g) as psfMag_g, 
s . extinction_r) as psfMag_r, 
s . extinction.!) as psfMag_i, 
s . extinction_z) as psfMag_z, 



u. obj ID=s . obj 
AND 

( (u.psfMag_u 
(u.psfMag_g - 
AND 

(u.psfMag_g - 
(u.psfMag_r - 
AND 

(u.psfMag_r - 
(u.psfMag_i - 
AND 

(u.psfMag_i - 
(u.psfMag_z - 
AND ((flags & 
AND ((flags & 
AND (((flags 
AND (((flags 



ID and s.type=6 
- s . extinction_u) - 

s . extinction_g) ) between -0.25 and 1.00 
s . extinction_g) - 

s . extinction_r) ) between -0.25 and 0.75 
s . extinction_r) - 

s . extinction_i) ) between -0.30 and 0.50 
s . extinction_i) - 

s . extinction_z) ) between -0.30 and 0.50 
0x10000000) != 0) 
0x8100000c00a4) = 0) 
& 0x400000000000) = 0) or (psfmagerr_g <= 0.2)) 
& 0x100000000000) = 0) or (flags & 0x1000) = 0) 



14 Sheelu Abraham et al. 



Table 9: Detailed description of the cross-matching results and the cat- 
alogues used 



Cat. Code Remarks 



2DF 



XBH* 



ASFS 



BATCS 

CGRBS** 

DLyaQ** 
F2QZ** 

KFQS 

LQAC 
LQRF 

BZC** 

PCS** 

ROSA** 
SQ13** 

SQR13 

DR7Q** 



The 2dF-SDSS LRG and QSO survey (Croom et al. 2009a ) is a spectroscopic quasar catalogue 
which covers an area of 191.9 deg^ There are 30,261 objects in 2dF catalogue that were 
overlapping with the colour space selected for our study. Of these, only 5510 2dF objects 
have spectroscopic confirmation. In that 4247 objects were predicted as quasars, 
1113 objects as stars and remaining 238 objects were predicted as galaxies by our 
algorithm. Thus the overall accuracy is 98 per cent. 

This is the catalogue of 318 radio-quiet and X-ray emitting quasars (RQQ) studied by 
Kellv et al. (2008 ). Of 318 detections, 212 in the colour region investigated by us. 
All of them were correctly predicted by our classifier. 

The CRATES: All-Sky Survey of Flat-Spectrum Radio Sources (Healev et al. 2007) has 
14,467 sources of which 1131 overlap with the region of our analysis. These 
objects are characterized by a flat radio spectra with high variability in optical, significant 
polarization and bimodal synchrotron/Compton spectral energy distributions. Hence 
all of them are believed to be AGNs viewed 'pole-on'. The classifier correctly identified 
1088 of them as quasars while it predicted 31 as stars and 12 as galaxies. 

Zhang et al. (2004) list the optical counterparts of 157 X-ray sources selected using the multicolour 
CCD imaging observations made by the Beijing- Arizona- Taiwan-Connecticut Sky Survey. Of these, 
21 fall in the region of our catalogue and all are predicted as quasars while 
three of these objects were identified as star burst galaxies. 

CGRaBS is an all-sky gamma-ray Blazar candidates survey fHealev et al. 2008 ) selected by 
their fiat radio spectra. Of the 1625 target observations, 266 are in our catalogue. 265 of 
these were correctly classified as quasars while one got classified as galaxy. 

Curran et al. (2002) give a catalogue of 322 damped Lyman alpha absorbers. Of these 22 appear 
in our catalogue and 21 of them are correctly identified as quasars while one failed as a star. 
Cirasuolo et al. (2005 ) gives a sample of faint radio-loud quasars from FIRST. The sample 
has 238 detections of which 190 appear in our catalogue. 186 of them are correctly 
identified as quasars and 3 failed as stars and one as a galaxy. 

Maddox et al. (2008) have created a catalogue of 3154 if-band detections of possible quasar 
candidates and their spectroscopic classifications. 159 of these objects appear in our 
catalogue while three of them have no spectral class. 144 of them were labelled as quasars. 
Two was predicted as galaxy and 13 as stars. 

Souchav et al. (2009 ) have constructed a large quasar astrometric catalogue of 113666 quasars. 

Of this, 61,788 overlap with our catalogue. It was found that our classifier correctly 

detected 61,504 of them while predicting 267 as stars and 17 as galaxies. 

Andrei et al. (2009) has constructed a large quasar reference frame of 100165 

quasars observed in different surveys. Of these, 60,513 fall in the region of our 

catalogue objects. The classifier could correctly identify 60,280 of these objects 

as quasars while 219 got labelled as stars and 14 as galaxies. 

Roma-BZCAT (Massaro et al. 2009 ) is a catalogue of 2837 blazars of which 255 had photometric 
detection by SDSS in the colour space of our catalogue. The classifier correctly detected 249 of 
them while got 4 incorrectly labelled as galaxies and 2 as stars. 

Kuraszkiewicz et al. (2004) provide a catalogue of 220 spectroscopically confirmed AGNs using the 

Faint Object Spectrograph on the Hubble space telescope. Of these, 55 i-band bright objects 

appear in our catalogue. 53 of which are correctly identified as quasars while incorrectly labelled 2 as stars. 

Suchkov et al. (2006) gives 1744 type 1 AGNs that have X ray observation in ROSAT PSPC. 

Of this, 1135 are present in our catalogue. All of these except 1 got correctly classified as quasars. 

Quasars and Active Galactic Nuclei (13th Ed.) (Veron-Cettv fc Veron 2010 ) is a catalogue of 

168,941 (all known prior to July 1st, 2009) AGNs. 65,673 of these objects have entries 

in our catalogue of which 65,223 were correctly identified as quasars while 395 got labelled 

as stars and 55 as galaxies. 

Rejected Quasars and Active Galactic Nuclei (13th Ed.) (Veron-Cettv fc Veron 2010 ) 
has 178 entries that were previously considered as quasars and were rejected as mostly stars. 
Of these 28 objects have entry in our catalogue. Our algorithm incorrectly classified 7 of 
them as quasars while 21 were correctly labelled as stars. 

Schneider et al. (2010) give the fifth edition of SDSS quasar catalogue consisting of 105,783 
spectroscopically confirmed quasar candidates. Of these, the colour space covered by our catalogue 
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sssc 

SSA13 
XMMSS 
SDSS/XMM 
RASS/2MASS 
CAIXA 
WDMB 

PMS 

GLQ 
CNDWF 

ARC 

XMM2iS 

ROSAT-FSC 
XMMCOSMOS 



contains 79,498 quasars. Our classifier correctly identified 79,140 quasars while it labelled 
17 as galaxy and 341 as star. 

Skiff catalogue of Stellar Spectral Classifications fSkiff 2009) is a compilation of 423055 
stellar objects from literature. Of these, 1255 have entries in our catalogue. Our classifier 
labelled 82 objects as quasars of which 2 have been identified as quasars by SDSS DR7 
quasar catalogue. Two objects got labelled as galaxy, while the remaining 1171 were 
correctly identified as stars. 

Fomalont et al. (20061 prepared radio/optical catalogue of the SSA 13 Field that has 878 

radio sources of which only 6 have entries in our catalogue. All the objects except one were correctly 

picked by our classifier. 

The second XMM-Newton Serendipitous Source catalogue (Watson et al. 2009 ) has 3504 point 
sources and 42 of these are in our catalogue. 37 of them were predicted as quasars 
in that one is an emission line galaxy and 5 as stars. 

Young et al. (2009) gives the optical quasar candidates of 792 X-ray sources observed 
serendipitously in the X-ray with XMM Newton. 580 of these objects appear in our catalogue 
and all of them were correctly identified by the classifier. 

Haakonsen fc Rutledge (2009 ) gives an associated catalogue of 18,806 X-ray sources in RASS/BSC 
that have counterpart with near-infrared sources from 2MASS/PSC. Of these, 6 objects appear 
in our catalogue and all are labelled quasars by our classifier. 

The catalogue of AGN in the XMM-Newton archive (Bianchi et al. 2009) has 156 
radio-quiet X-ray unobscured AGNs of which 16 appear in our catalogue. All of them 
got classified as quasars. 

Heller et al. (2009 ) gives 857 white dwarf - M binaries from SDSS DR6, of which 126 were present 
in our catalogue. 106 of them got labelled as stars while 20 were labelled as quasars. 
One object predicted as quasar is a confirmed quasar from SDSS DR7 quasar catalogue. 
White dwarfs are known contaminants in photometric quasar catalogues 
explaining the relative low classification accuracy of 84 per cent in this case. 

Gould fc Kollmeier (2004) prepared a catalogue of proper motion of 390,476 stars from SDSS and 



USNO-B observations. 20,241 objects from it are present in our catalogue. Our classifier labelled 
19,596 of the objects as stars, 639 as quasars and 6 as galaxies. Out of the 623 quasars predicted, 
18 are confirmed quasars. 

Oguri et al. (2008 ) gave 4 gravitationally lensed quasars from SDSS Quasar Lens Search, 
which is a systematic survey of lensed quasars from SDSS spectroscopic quasars. Of these, 
two fall in the region of our catalogue and both are correctly predicted by the classifier. 
The Chandra XBootes Survey optical counterpart catalogue (Brand et al. 2006 ) has 5318 point 
sources and 392 of them are in our catalogue. The true class of the objects is not known. The 
classifier labelled 365 as quasars, 23 as stars and 4 as galaxies. 
Astrometric positions of radio sources (Asian et al. 2010 ) give the positions of the 
extragalactic radio detection of about 300 objects. Of this 38 are in our catalogue. All of them 
were classified as quasars. 

The XMM-Newton Second Incremental Source catalogue 

give a catalogue of 221,012 X-ray sources. Of these 3,490 overlap with our catalogue. Our classifier 
identified 3,176 objects as quasars, 287 as stars and 27 as galaxies. 

Veron-Cettv et al. (2004 ) give optically selected bright AGN samples in ROSAT Faint Source 
catalogue. Of the 103 quasar candidate in this, 33 are present in our catalogue. 24 objects were correctly 
predicted by the classifier while 9 as stars. 

The XMM - Newton wide-field survey (Cappelluti et al. 2009) in the COSMOS field gives 
1887 point-like X-ray sources. 96 objects were present in our catalogue. Of these objects 88 
were predicted as quasar while 8 as stars. 
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